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Abstract —Supralinear and sublinear pre-synaptic and den¬ 
dritic integration is considered to be responsible for nonlinear 
computation power of biological neurons, emphasizing the role 
of nonlinear integration as opposed to nonlinear output thresh¬ 
olding. How, why, and to what degree the transfer function 
nonlinearity helps biologically inspired neural network models is 
not fully understood. Here, we study these questions in the context 
of echo state networks (ESN). ESN is a simple neural network 
architecture in which a fixed recurrent network is driven with 
an input signal, and the output is generated by a readout layer 
from the measurements of the network states. ESN architecture 
enjoys efficient training and good performance on certain signal¬ 
processing tasks, such as system identification and time series 
prediction. ESN performance has been analyzed with respect to 
the connectivity pattern in the network structure and the input 
bias. However, the effects of the transfer function in the network 
have not been studied systematically. Here, we use an approach 
tanh on the Taylor expansion of a frequently used transfer 
function, the hyperbolic tangent function, to systematically study 
the effect of increasing nonlinearity of the transfer function on the 
memory, nonlinear capacity, and signal processing performance 
of ESN. Interestingly, we find that a quadratic approximation is 
enough to capture the computational power of ESN with tanh 
function. The results of this study apply to both software and 
hardware implementation of ESN. 

I. Introduction 

McCullough and Pitts [ F| showed that the computational 
power of the brain can be understood and modeled at the 
level of a single neuron. Their simple model of the neuron 
consisted of linear integration of synaptic inputs followed 
by a threshold nonlinearity. Current understanding of neural 
information processing reveals that the role of a single neuron 
in processing input is much more complicated than a linear 
integration-and-threshold process (2). In fact, the morphology 
and physiology of the synapses and dendrites create important 
nonlinear effects on the spatial and temporal integration of 
synaptic input into a single membrane potential 0. Moreover, 
dendritic input integration in certain neurons may adaptively 
switch between supralinear and sublinear regimes {4j. From 
a theoretical standpoint this nonlinear integration is directly 
responsible for the ability of neurons to classify linearly 
inseparable patterns (5j. The advantage of nonlinear processing 
at the level of a single neuron has also been discussed in the 
artificial neural network (ANN) community (6|. 

Historically, the ANN community has been concerned with 
algorithms for finding the correct interaction pattern between 
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neurons for a specific task 0-0- Some work in the field 
has emphasized the importance of suitable collective behavior 
of the neural network facilitated by macroscopic parameters 
over microscopic degrees of freedom. Dominey et al. m 
proposed a simple model for the context-dependent motor 
control of eyes. In this model, the prefrontal cortex represents 
a suitable high-dimensional mapping of visual input that is 
adaptively projected onto basal ganglia, which in turn control 
the eye movement. The only task-dependent learning in this 
model occurs in the projection layer. This model has also been 
used to explain higher-level cognitive tasks such as grammar 

comprehension in the brain 03- 

More abstract versions of this model. Liquid State Ma¬ 
chines on and Echo State Networks (12) , fl3| , were later 
introduced in the neural network community and were subse¬ 
quently unified under the name reservoir computing (RC) 03- 
In RC, an easily tunable high-dimensional recurrent network, 
called the reservoir, is driven by an input signal. An adaptive 
readout layer then combines the reservoir states to produce a 
desired output. Figure [I] provides a conceptual illustration of 
RC. ESN implements this idea with a discrete-time recurrent 
network with linear or tanh activation functions and a linear 
readout layer trained using regression. Many variations of 
ESN exist and have been successfully applied to different 
engineering tasks, such as time series prediction and system 
identification (Bl¬ 
owing to its fixed recurrent connections, training an ESN is 
much more efficient than ordinary recurrent neural networks 
(RNN), making it feasible to use its power in practical ap¬ 
plications. ESN’s power in time series processing has been 
attributed to the reservoir’s memory urn m and high¬ 
dimensional projection of the input which acts like a temporal 
discriminant kernel |18j that is present in the critical dynamical 
regime, where input perturbations in the reservoir dynamics 
neither spread nor die out 

A major research direction in RC is to study how the 
nonlinear dynamics of the reservoir may improve the per¬ 
formance in different tasks (B), (22). In particular, the goal 
is to understand and enhance the high-dimensional nonlinear 
mapping created by the reservoir dynamics. In the case of ESN 
architecture, the nonlinearity of the reservoir can be ascribed 
to its connectivity pattern, transfer function, and the input 
bias. While there have been some studies focusing on the 
effect of connectivity and bias (21), (23) , the transfer function 
nonlinearity has never been systematically studied, to the best 



Fig. 1: Computation in an ESN. The reservoir is an excitable recurrent network with N readable output states represented by the 
vector X(f). The input signal u(f) is fed into one or more points i in the reservoir with a corresponding weight ft),, denoted by 
the weight column vector ft) = [ft),]. 


of our knowledge. 


II. Background 


Here, we examine what happens when we replace the tanh 
function in the ESN reservoir with its partial Taylor series 
expansion, varying the number of terms included. The addition 
of each successive term will increase the order of nonlinearity 
present in the transfer function, allowing us to gradually 
interpolate between the linear and the tanh transfer functions. 
In addition, we will explore the input weight scaling to study 
the effect of sublinear integration on ESN performance, at each 
level of nonlinearity. To control for other sources of variation, 
we will restrict ourselves to the two most constrained reservoir 
architectures that are known to preserve the computational 
performance of the classical random reservoir, the simple cycle 
reservoir (SCR) (24) and the Gaussian orthogonal reservoir 

GU- 


The main contribution of this work is a systematic study 
of the role of the transfer function nonlinearity in the total 
information processing capacity of recurrent neural networks. 
Section [TT] outlines the context and motivation of this work. 
In Section |III-A| we review the basic ESN formulation used 
in this study. In Section III-B| we describe the details of our 
Taylor expansion approach to quantify the degree of nonlin¬ 
earity and its impact on the performance of tanh-neuron ESN. 
The experimental study on information processing properties 
of ESNs with Taylor expanded transfer functions is presented 
in Section[IV] We first study the memory and also the nonlinear 
memory capacity of echo state networks with different transfer 
function nonlinearity, then we evaluate the performance of 
such networks against time-series tests of Mackey-Glass and 
NARMA 10. In all cases, we find that the second order approx¬ 
imation of the tanh function provides all the nonlinear benefits 
of the tanh with no significant improvement to the network 
performance with increasing nonlinearity. Moreover, we show 
that the region of the tanh function which is usually thought of 
as linear is actually very nonlinear. RC has been suggested as a 
suitable signal processing framework for hardware realizations 
targeting unconventional substrates |j25) and ultra-low power 
implementations, due to its multitasking capability, robustness 
to noise and variations, and a fixed computational core (26) , 
ED- The result of this work can be used to simplify potential 
hardware designs for RC while preserving their accuracy. 


Understanding the nature of computation and its properties 
is an active subject of theoretical study in reservoir computing. 
Hermans and Schrauwen ID showed that the ESN reservoir 
acts as a recursive kernel that generates a high-dimensional 
mapping of an input signal that can be used by the readout 
layer to reconstruct a target output. Busing et al. ED studied 
the relationship between the reservoir and its performance and 
found that while in continuous reservoirs the performance of 
the system does not depend on the topology of the reservoir 
network, coarse-graining the reservoir state will make the 
dynamics and the performance of the system highly sensitive 
to its topology. Verstraeten et al. (23) used a novel method 
to quantify the nonlinearity of the reservoir as a function of 
input weight magnitude. They used the ratio of the number 
of frequencies in the input to the number of frequencies in 
the dynamics of the input-driven reservoir as a proxy for 
the reservoir nonlinearity. As a result of these studies the 
growing consensus is that from a theoretical perspective one 
would obtain more nonlinear computational power in the 
reservoir by adjusting the input weight magnitudes such as 
to project the input onto the more nonlinear regions of the 
tanh transfer function (28) (see Figure [4a]). This opens an 
interesting research area; however, in existing approaches the 
linear and nonlinear regions of the tanh function are not defined 
precisely. Moreover, there is little evidence that using the 
so-called nonlinear region of the tanh actually improves the 
performance on nonlinear tasks |23| , (24) , (28) . 

To illustrate the effect of using the nonlinear parts of the 
tanh function, we have included sensitivity analysis of reser¬ 
voirs with linear and tanh transfer functions for solving four 
different benchmarks, linear memory, nonlinear computation 
capacity, Mackey-Glass chaotic prediction, and NARMA 10 
computation (see Section [Tv| for task details and Section 1 111 - A | 
for reservoir model). For memory and nonlinear computation 
capacity a reservoir of N = 50 nodes was used, and for 
Mackey-Glass and NARMA 10 tasks reservoirs of N = 500 
and N = 100 nodes were used, respectively. The reservoirs are 
generated by sampling the standard Gaussian distribution and 
are rescaled to have spectral radius A. Input weights are drawn 
from the Bernoulli distribution over {—1, + 1} and multiplied 
by input weight coefficient v. The reservoir parameters v and A 
were swept on the interval [0.1,1] with 0.1 increments and the 
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Fig. 2: Illustration of sensitivity of ESN performance to v and 
A. For linear memory and nonlinear capacity the highest values 
are optimal and for Mackey-Glass prediction and NARMA 10 
computation the lowest values are optimal. The optimal values 
for all tasks occur for low v, where the input signal is mapped 
onto the so-called linear region of the tanh (A) function. 


results were averaged over 10 runs. Figure [2] shows the results 
of the sensitivity analysis. For all the tasks, the best results are 
achieved for the lowest v values, which maps the inputs signals 
well within the speculated linear region of the tanh function. 
In this work, our goal is to decompose the nonlinearity of the 
tanh function and study its effects as a function of the degree 
of nonlinearity and input strength v. 

III. Model 

A. Echo State Network 

An ESN consists of an input-driven recurrent neural net¬ 
work, which acts as the reservoir, and a readout layer that reads 
the reservoir states and produces the output. Mathematically, 
the input driven reservoir is defined as follows. Let N be 
the size of the reservoir. We represent the time-dependent 
inputs as a column vector u(t), the reservoir state as a column 
vector x(t), and the output as a column vector y(t). The input 
connectivity is represented by the matrix ft) and the reservoir 
connectivity is represented by an N x N weight matrix Cl. For 
simplicity, we assume that we have one input signal and one 
output, but the notation can be extended to multiple inputs and 
outputs. The time evolution of the reservoir is given by: 

x(t + \) = f(Cbc(t) + COu(t)). (1) 

where / is the transfer function of the reservoir nodes that is 
applied element-wise to its operand. This function is usually 
tanh or linear. The output is generated by the multiplication of 
a readout weight matrix of length N +1 and the reservoir 
state vector x(t) extended by an optional constant 1 represented 
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Fig. 3: (a) Schematic of a linear ESN. A time-varying input 
signal u(f) drives a dynamical core called a reservoir. The 
states of the reservoir x(f) are combined linearly to produce 
the output y(f). The reservoir consists of N nodes. The input 
and the reservoir connections are given by the vector ft) and 
the matrix Cl respectively. The reservoir states and the constant 
are connected to the readout layer using the weight matrix 
V P. (b) A Taylor series ESN with a similar structure to linear 
ESN, but with Taylor series expansion of tanh tanh for the 
transfer functions of the reservoir, (c) A tanh ESN with a 
similar structure to linear ESN, but with tanh nonlinearity in 
the transfer functions of the reservoir. Usually a tanh function 
is used. 



















(a) tanh function (b) tanh Taylor series expansion (c) approximating tanh 

Fig. 4: (a) tanh and its first and second derivatives, (c) Taylor series approximation to tanh (c) Distance of Taylor series expansions 
to tanh. 


by x'(t): 


y(t) — X i / x'(t)- 


( 2 ) 


The readout weights 'F need to be trained using a teacher 
input-output pair. A popular training technique is to use the 
pseudo-inverse method [14j. One drives the ESN with a teacher 
input and records the history of the reservoir states into a 
matrix X, where the columns correspond to the reservoir nodes 
and each row gives the states of all reservoir nodes at one time. 
A constant column of Is is added to X to serve as a bias. The 
corresponding teacher output will be denoted by the column 
vector y. The readout can be calculated as follows: 

V = (XX') “'(XY'), (3) 


where ' indicates the transpose of a matrix. Figures 3a and 3c 


show the architecture of ESNs with linear and tanh activation 
functions, respectively. Figure [3b] shows the architecture of an 
ESN with the Taylor series approximation of tanh as transfer 
function. In the next section, we will describe how we will 
use these approximations to systematically study the transfer 
function nonlinearity in the reservoir. 


B. Transfer Function Nonlinearity 

Our goal is to systematically explore the effect of nonlin¬ 
earity of the reservoir transfer function on the ESN memory 
and performance. Figure |4a| illustrates the tanh (A) function and 
its first and second derivatives, i.e., d tanh(x) and £/ 2 tanh(x). 
The tanh(x) function is often considered to behave linearly for 
|x| < 0.5, and nonlinearly otherwise. However, looking closely 
at the curves of r/tanh(x) and t/ 2 tanh(x), we see that the only 
place where the tanh(x) behaves linearly (constant t/tanh(x)) 
is when x —> 0. As x increases in magnitude its first derivative 
changes very rapidly with increasing rate, i.e., steep d/ 2 tanh(x), 
until |x| = 0.65. This observation suggests that the so-called 
linear region of tanh(x) function is where the function becomes 
highly nonlinear very quickly as x increases. 

We would like to decompose the nonlinearity of tanh 
and study how much each additional degree of nonlinearity 
affects the performance of the ESN. To this end, we use the 
Taylor series expansion of the tanh function around x = 0 to 
systematically interpolate the orders of nonlinearity between 
the linear transfer function to tanh transfer function. We will 
replace the tanh transfer function with the transfer functions 


that we obtain by writing the tanh Taylor series to m terms, 
denoted by ■%,,. Table |T] lists the first few expansions as well 
as the exact Taylor series for tanh and Figure [3b] illustrates the 
architecture of the ESN with Taylor series expansions as the 
reservoir transfer function. 
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TABLE I: Example of Taylor series expansions of tanh(x) with 
different orders m. Here, lh m is the number at the position 2m 
in Bernoulli sequence. 


Figure [4b] shows the curves corresponding to the first four 
expansions of the tanh(x) function. Although the Taylor series 
expansion of tanh(x) is defined for |x| < 71, it is only for |x| < 1 
that the lowest order expansions do not rapidly diverge from 
tanh(x). Figure [4c] shows the root-mean-squared error (RMSE) 
between the Taylor expansion m and the tanh(x) function 
calculated for |x| < 1. With increasing number of terms in 
the expansion, the approximation approaches the true tanh 
exponentially (the inset plot). Understanding this exponential 
behavior suggests most of the benefits of the tanh nonlinearity 
may come from the first few orders of nonlinearity, and this 
will help us to interpret the results in the later sections. 

IV. Experiments 

In this section we study the effect of nonlinearity of the 
transfer function in ESNs using two parameters, the input 
weight coefficient v and the order of the Taylor series ex¬ 
pansions used as the transfer function m. We will evaluate the 
performance of ESNs in linear memory capacity, nonlinear 
capacity, Mackey-Glass chaotic time series prediction, and 
NARMA 10 computation. 

To make a fair comparison between systems, we adjust v 
and the input signal scaling so that the the magnitude of the 
reservoir states is less than 1. The next section will give the 
details of ESN construction and evaluation. 
















A. Reservoir Construction and Evaluation 


To control for the variations that are due to topological fac¬ 
tors, we will use very constrained reservoir architectures. For 
the memory task, Mackey-Glass prediction, and NARMA 10 
computation we will use the simple cycle reservoir (24). This 
topology compares well with random topology in memory and 
signal-processing benchmark performance, while minimizing 
the structural variations of the reservoir. In the simple cycle 
reservoir, the reservoir is a simple ring topology with uniform 
positive weights r. In this topology the weight r determines the 
reservoir spectral radius: r = |A | and no rescaling of the weight 
matrix is needed. In initial experiments, we observed that the 
simple cycle is unable to perform the nonlinear capacity task. 
For this task we create the reservoir by sampling the Gaussian 
orthogonal ensemble (GOE) Gl- The reservoir weight matrix 
in this case is given by Q. =A+A', where A is a matrix with 
the same dimensionality as 11 where the entries are sampled 
from the standard Gaussian distribution oV(0, 1). The reservoir 
is then rescaled to have spectral radius A. The number of 
reservoir nodes N is adjusted for each task to get reasonably 
good results in a reasonable amount of time. The input weights 
are generated by sampling the Bernoulli distribution over 
{—1,4-1} and multiplying with the input weight coefficient 
v. The reservoir nodes are initialized with Os and a washout 
period of 2 N is used during training and testing. 


The reservoirs are driven with task-dependent input u t for 
2,000 time steps and the readout weights V F are calculated as 
described in Section III-A using MATLAB’s pinv() function. 
For evaluation, the reservoir state is reinitialized and the 
reservoir is driven for another T = 2,000 time steps and the 
output y, is generated. For brevity, throughout the experiments 
section we adopt the subscript notation for the time index, e.g., 
y t instead of y(t). By convention, the system performance for 
computational capacity tasks is evaluated using the capacity 
function C T , which is the coefficient of determination between 
the output y, and the desired output y t : 


Cr = 


Cov 2 (y t ,yt) 
Var(y r )Var(y r ) : 


(4) 


where x is the memory length for the task (see Section IV-B 
for details). For the chaotic prediction task, the performance 
is evaluated by calculating the normalized mean-squared-error 
NMSE as follows: 


function C T measures how long a network can remember 
its inputs. These capacities are calculated by summing the 
capacity function over x: C = We use 1 < T < 100 for 

our empirical estimations. In these sets of experiments reser¬ 
voirs of size N = 50 nodes are driven with a one-dimensional 
input drawn from uniform distributions on [—0.5,0.5]. We fix 
A = 0.9 for all experiments. The desired output for this task 
is defined as: 


yt = u t - x . 


( 6 ) 


Figure[5a]shows the total linear memory capacity surface as 
a function of m and v. Consistent with previous theoretical and 
experimental results the linear memory capacity does not show 
any dependency on v for m = 1 , i.e., for the linear network. 
However, for large v > 0.05 and in > 1 we observe a deviation 


from linear memory with no dependence on m. Figure 5b 


shows the total memory capacity for the tanh transfer function 
as a function of v on a linear-log scale, clearly showing 
that for v < 0.05 the total memory capacity of the network 
equals that of a linear network. Figure 5c shows the total 
capacity for v = 0.1 for various m, confirming that for m > 1 
the memory capacity does not vary with m, and suggesting 
that all the relevant nonlinear characteristics of the network 
stemming from tanh can be observed on the second-order 
Taylor expansion m = 2. 



NMSE = 


\J jlJ=o(yt-?t ) 2 

Var (y t ) 


(5) 


where y, is the network output and y, is the desired output. 


For all tasks we systematically explore v € 
{10 5 ,..., 10 1 } with quarter decade increments and 
v £ {0.2,...,0.35} with 0.05 increments. All results are 
averaged over 10 runs. We chose this range for v in 
preliminary runs in combination with appropriate input 
scaling for each task to ensure that the magnitude of reservoir 
states is always less than 1. 


B. Linear Memory Capacity 

The linear memory capacity is a standard measure of 
memory in recurrent neural networks. The T-delay memory 




(b) (c) 

Fig. 5: (a) Linear memory capacity for different v and m. 
(b) For v' < 0.05 the memory capacity of the tanh network is 
similar to that of a linear network, (c) Increasing nonlinearity 
beyond m > 2 there is no change in the memory capacity of 
the network. 


















C. Nonlinear Computation Capacity 


The nonlinear computation capacity measures the ability 
of the system to reconstruct a nonlinear function of its past 
inputs. Commonly, Legendre polynomials are used to calculate 
the nonlinear computation capacity of the reservoir (TT); their 
advantage is that Legendre polynomials of different orders 
are orthogonal to each other, allowing one to measure the 
reservoir’s capacity to compute functions of varying degrees of 
nonlinearity independently from each other. These capacities 
are calculated by summing the capacity function over t: 
C = L t C t . We use 1 < T < 100 for our empirical estimations. 
In these sets of experiments reservoirs of size N = 50 nodes 
are driven with a one-dimensional input drawn from uniform 
distributions on [—1,1]. We fix A =0.1 for all experiments. 
We have previously observed this is the optimal A for this 
task. The desired output of the Legendre polynomial of order 
n with delay T is given by: 

9M = i f ('') Vt - !)"-*(«,-*+ 1)*. (7) 

We must point out that unlike f77) , here the network has 
to reconstruct the output of a single polynomial and not the 
product of several polynomials. In this work we only focus 
on the case n = 3. For n = 1, the nonlinear capacity measure 
reduces to linear memory and the tanh are unable to compute 
the even orders because of the input-output symmetry. 


Figure [6a] shows the total nonlinear capacity surface as a 
function of m and v. For v > 0.001 and m > 1 we observe a 
deviation from the linear network capacity, with no dependence 
on m. Figure 6b shows the nonlinear capacity for the tanh 
transfer function as a function of v on a linear-log scale, 
clearly showing that for v < 0.001 the nonlinear capacity 


of the network equals that of a linear network. Figure 6c 


shows the total capacity for v = 0.1 for various m, confirming 
that for m > 1 the nonlinear capacity does not vary with 
m, suggesting all the relevant nonlinear characteristics of the 
network stemming from tanh can be observed on the second- 
order Taylor expansion m = 2. We emphasize that we have 
used a standard ESN implementation without reservoir bias for 
simplicity. Applying a bias to the reservoir drastically changes 
the nonlinear capacity and requires a more thorough analysis. 


D. Mackey-Glass System Prediction 

The Mackey-Glass system |29] is a delayed differential 
equation with chaotic dynamics, commonly used as a bench¬ 
mark for chaotic signal prediction. This system is described 
by: 


dx, 

dt 


P 


x t -S 

1 + x "-3 


-yx t , 


( 8 ) 


where [3 =0.2, « = 10, and 7=0.1 are positive constants and 
5 = 17 is the feedback delay. The reservoir consists of N = 500 
nodes and A = 0.9. The task is to predict the next T integration 
time steps given x t . We scaled the time series between [0,0.5] 
before feeding the network. 


Figure [7a] shows the NRMSE surface as a function of m and 
v. For m > 1 we observe a deviation from the linear network 
performance with no dependence on m. Figure 7b shows the 
performance for the tanh transfer function as a function of v 





(b) (c) 


Fig. 6: (a) Nonlinear capacity for different v and m. (b) For 
v < 0.001 the nonlinear capacity of tanh network is similar 
to that of a linear network, (c) Increasing nonlinearity beyond 
m > 2 there is no change on the nonlinear capacity of the 
network. 


on a linear-log scale, clearly showing that for v < 0.00075 the 
performance of the network equals that of a linear network, 
with no improvement for v > 0.1. Figure [7c] shows the perfor¬ 
mance for v' = 0.1 for various m, confirming that for m > 1 the 
performance does not vary with m, suggesting all the relevant 
nonlinear characteristics of the network stemming from tanh 
can be observed on the second-order Taylor expansion m = 2. 
In our experiments, we found that although applying a bias 
to the reservoir improves its nonlinear capacity, it does not 
improve the performance for Mackey-Glass tasks. 


E. NARMA 10 Computation 

NARMA 10 ]24) is a highly non-linear auto-regressive task 
with long lags that is frequently used to assess neural network 
performance. This task is given by the following equation: 

n 

}'i = ay t -i+l3y,-i ^yt-i + yut-nUt-i + 5 , ( 9 ) 

£—1 

where n =10, a = 0.3, f3 = 0.05,7= 1-5,5 = 0.1. The input 
Ut is drawn from a uniform distribution in the interval [0,0.5]. 
We use reservoir networks of size N = 100 and A = 0.8. 


Figure 8 a shows the NRMSE surface as a function of m 
and v'. For m > 1 we observe a deviation from the linear 


network performance, with no dependence on m. Figure 8b 


shows the performance for the tanh transfer function as a 
function of v on a linear-log scale, clearly showing that for 















Fig. 7: (a) Mackey-Glass prediction performance for different 
v and in. (b) The prediction performance for tanh network. For 
v < 0.00075 the the performance of tanh network is similar to 
that of a linear network, (c) Increasing nonlinearity beyond 
m > 2 there is no change on the memory capacity of the 
network. 


Fig. 8: (a) NARMA 10 performance for different v and m. 
(b) The performance for tanh network. For v < 0.01 the the 
performance of the tanh network is similar to that of a linear 
network, (c) Increasing nonlinearity beyond m > 2 there is no 
change on the memory capacity of the network. 


v < 0.01 the performance of the network equals that of a linear 
network with, no improvement for v > 0.01. Figure [8c] shows 
the performance for v = 0.1 for various m. In this case because 
of large standard deviation we cannot conclusively say that the 
increasing nonlinearity in the transfer function is helpful. 

V. Conclusion and outlook 

Nonlinearity of pre-synaptic and dendritic integration plays 
an important role in the nonlinear computational ability of bio¬ 
logical neurons. Similarly, nonlinearity of the transfer function 
in neural networks is known to increase the capability of the 
simple multi-layer perceptron to approximate any function. In 
this work, we systematically studied the effect of increasing 
nonlinearity on the memory, nonlinear capacity, and the signal¬ 
processing performance of echo state networks (ESN), a class 
of efficient recurrent neural network with state of the art perfor¬ 
mance in chaotic signal prediction. We found that the region of 
the tanh function usually thought of as linear is actually quite 
nonlinear. Moreover, we found that all the nonlinear power of 
the tanh transfer function can be produced using its second- 
order Taylor approximation. This finding suggests that ESN 
performance will benefit from qualitative nonlinearity and not 
from the degree to which the transfer function is nonlinear. 
How and why small transfer function nonlinearity helps ESNs 
will be the subject of our future research. 
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